Scraping examples

Some BeautifulSoup examples to help you scrape the HTML from web pages.

See BeautifulSoup documentation for technical reference.

1. Import libraries



In [ ]:

    
# Import all the things!
import urllib.request
from datetime import *
from lxml import html
from bs4 import BeautifulSoup

2. Define scraping functions



In [ ]:

    
# Scrape all HTML from webpage.
def scrapewebpage(url):
	# Open URL and get HTML.
	web = urllib.request.urlopen(url)

	# Make sure there wasn't any errors opening the URL.
	if (web.getcode() == 200):
		html = web.read()
		return(html)
	else:
		print("Error %s reading %s" % str(web.getcode()), url)

# Helper function that scrape the webpage and turn it into soup.
def makesoup(url):
	html = scrapewebpage(url)
	return(BeautifulSoup(html, "lxml"))

3. BeautifulSoup examples

A) Find an `id`

Use find(id="name") to find the first HTML tag that has an id attribute like this: <h2 id="mp-itn-h2"></h2>



In [ ]:

    
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")



In [ ]:

    
# Match the <h2> tag with id the id mp-itn-h2
h2 = wp_soup.find(id="mp-itn-h2")

h2



In [ ]:

    
# Only get the text inside <h2>.
h2.get_text()

B) Find a `class`

Use find("", "name") to find the first HTML tag that has a class attribute like this: <h2 class="name"></h2>



In [ ]:

    
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")



In [ ]:

    
# Find the first HTML tag that has class mw-headline.
headline = wp_soup.find("", "mw-headline")

headline



In [ ]:

    
# Only get the text inside the <span>.
headline.get_text()

C) Find everything with a `class`

Use find_all("", "name") to find all HTML tags that has a class attribute like this: <h2 class="name"></h2>



In [ ]:

    
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")



In [ ]:

    
# Find all HTML tag that has class mw-headline.
all_headlines = wp_soup.find_all("", "mw-headline")

all_headlines



In [ ]:

    
# Now we have a list that we can use a for loop.
for headline in all_headlines:
    headline = headline.get_text()
    print(headline)

D) Find all `<h3>`

Use find_all("h3") to get all <h3> (or something else).



In [ ]:

    
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")



In [ ]:

    
# Find all HTML tag that has class mw-headline.
all_h3 = wp_soup.find_all("h3")

all_h3



In [ ]:

    
# Now we have a list that we can use a for loop.
for h3 in all_h3:
    h3 = h3.get_text()
    print(h3)

E) Find all column values in a table

Use for to loop through a <table> and extract all column values.



In [ ]:

    
# Scrape a Wikipedia page with a table.
champ_soup = makesoup("https://en.wikipedia.org/wiki/European_Road_Championships")



In [ ]:

    
# Find <table class="wikitable">.
table = champ_soup.find("table", "wikitable")

table



In [ ]:

    
# Go through each row and take the text from 1st and 2nd column.
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    if len(cols) > 0:
        Year = cols[0].get_text()        # Get the text in the 1st column.
        Country = cols[1].get_text()     # Get the text in the 2nd column.

        print(Year + " " + Country)

F) Find a specific cell value in a table

Use this if you know the row and column of the <table> of the information you want to extract.



In [ ]:

    
table = champ_soup.find("table", "wikitable")

# Get cell value from row 5, column 1.
cell = table.find_all('tr')[5].find_all('td')[1].get_text()

cell

G) Find nested things

Use this if you want to find things that are nested inside each oher.



In [ ]:

    
# Scrape Wikipedia main page.
wp_soup = makesoup("https://en.wikipedia.org/wiki/Main_Page")



In [ ]:

    
# Find <table class="mp-middle">.
middle_table = wp_soup.find("table", id="mp-middle")

# In the <table>, find <h2>.
h2 = middle_table.find("h2")

h2



In [ ]:

    
# Only get the text inside the <h2>.
h2.get_text()